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meeting  of  the  American  Educational  Research  Association,  which  was  held  in  New  Orleans  in 
April  of  1988.  It  is  being  published  at  this  time  for  archival  purposes. 
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Development  Center  Independent  Research  and  Independent  Exploratory  Development  (IR/IED) 
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Summary 


Problem 

Conventional  methods  for  scoring  aptitude  and  achievement  tests  that  are  used  in  selecting, 
classifying,  and  training  military  personnel  discard  useful  information  about  an  examinee's  ability/ 
skill  level.  Information  is  lost  whenever  the  original  responses  to  test  questions  are  classified  only  as 
“right”  or  "wrong.”  Additional  information  can  be  obtained  by  considering  the  difficulty  level  of  the 
questions  answered  correctly  and  by  taking  into  account  which  particular  wrong  answers  were 
selected. 

Objective 

The  objective  of  this  effort  was  to  develop  new  procedures  for  scoring  aptitude  and  achievement 
tests  that  will  increase  the  reliability  and  validity  of  those  tests. 

Approach 

In  this  research,  the  authors  conducted  an  empirical  evaluation  of  a  new  test  scoring  procedure 
(polyweighting;  Sympson,  1993)  in  the  context  of  medical  certification  testing.  Data  from  1,100 
resident  physicians  who  had  completed  a  200-item  test  in  the  field  of  otolaryngology  (the  diagnosis 
and  treatment  of  ear,  nose,  and  throat  disorders)  were  obtained.  Five-hundred  of  these  frfiysicians 
were  selected  at  random  to  make  up  "Sample  A.”  Five-hundred  different  physicians  were  selected  at 
random  to  make  up  "Sample  B.”  TTie  computer  program  POLY  was  applied  to  the  Sample  A  data  in 
order  to  obtain  summary  statistics  and  polyweights  for  all  200  items. 

Using  the  set  of 200  items  as  an  item  bank,  the  authors  assembled  20  short  (10-,  20-,  30-,  40-  item) 
assessment  tests  and  scored  them  in  Sample  B.  TXvelve  assessment  tests  were  assembled  by  randomly 
selecting  items  and  eight  assessment  tests  were  assembled  by  selecting  "best”  items.  Both  proportion- 
correct  scores  and  test  scores  based  on  the  Sample  A  poly  weights  were  computed  in  Sample  B.  Then, 
itemal-consistency  reliability  coefficients  were  computed  and  both  types  of  test  score  were  correlated 
with  Sample  B  2C)0-item  domain  scores. 

Results 

For  all  20  assessment  tests,  poly  weighting  resulted  in  higher  cross-validated  internal-consistency 
reliability  (coefficient-a)  and  domain  validity  in  Sample  B.  The  observed  increases  in  reliability 
corresponded  to  a  mean  increase  in  test  length  of  28%.  Over  all  20  tests,  the  mean  increase  in  domain 
validity  was  .075.  The  minimum  increase  in  domain  validity  was  .052. 

Conclusions 

Results  of  this  study  indicate  that  polyweightingcan  provide  consistent  increases  in  test  reliability 
and  domain-related  validity.  These  findings  also  suggest  that  polyweighting  should  allow  test 
developers  to  reduce  test  length,  while  maintaining  test  reliability  at  the  level  observed  under 
traditional  number/proportion-correct  scoring. 

Reconunendation 

Organizations  that  administer  aptitude  and/or  achievement  tests  for  purposes  of  personnel 
selection,  classification,  or  training  should  consider  whether  the  new  scoring  procedure  r^n  be 
usefully  applied  to  their  tests. 
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Introduction 


Polychotomous  scoring  of  multiple-choice  test  items  is  based  on  the  assumption  that  ability 
(knowledge/skill)  distributions  are  not  the  same  for  examinees  who  choose  different  response 
options,  even  if  they  have  answered  the  same  number  of  items  correcily.  If  this  assumption  is 
correct,  additional  information  about  an  examinee’s  knowledge/skill-level  can  be  obtained  by 
noting  which  questions  the  examinee  has  answered  correctly  and  which  incorrect  answers  were 
selected. 

A  variety  of  polychotomous  scoring  methods  have  been  tried,  dating  from  about  1935  to  the 
present  (Haladyna  &  Sympson,  1988).  These  methods  can  be  classified  as  either  linear  or 
nonlinear.  Linear  polychotomous  scoring  involves  the  use  of  fixed  scoring  weights  that  vary  over 
response  options.  Nonlinear  polychotomous  scoring  is  based  on  item  response  theory  (IRT)  and 
involves  the  use  of  likelihood  functions  (Bimbaum,  1968,  p.  455).  Since  realistic  IRT  models 
require  large  sample  sizes  {N>  1000)  for  item  calibration,  and  since  test  scoring  under  these  models 
usually  requires  an  assumption  that  the  test  is  unidimensional,  nonlinear  polychotomous  scoring  is 
less  widely  applicable  than  linear  polychotomous  scoring. 

Sympson  (1993)  has  introduced  a  new  method  for  linear  polychotomous  scoring  called 
polyweighting.  The  scores  obtained  with  this  method  are  called  polyscores.  The  purpose  of  this 
study  was  to  compare  polyscores  with  traditional  proportion-correct  scores  in  terms  of  their 
internal-consistency  reliabilities  and  domain  validities.  Comparisons  are  made  in  a  context  similar 
to  that  found  in  certification,  licensing,  proficiency,  or  competency  testing. 

Polyweighting 

The  category  scoring  weights  used  in  polyweighting  are  called  polyweights.  An  examinee’s 
polyscore  is  equal  to  the  mean  of  the  poly  weights  for  the  categories  chosen  by  the  examinee.  The 
iterative  procedure  used  to  derive  poly  weights  for  a  set  of  items  is  described  in  Sympson  (1993) 
and  implemented  in  the  computer  program  POLY.  Polyweights  are  defined  as  follows: 

1.  For  each  correct  answer,  the  poly  weight  is  equal  to  the  mean  percentile  rank  among 
examinees  choosing  the  answer,  rounded  to  the  nearest  integer. 

2.  For  each  wrong  answer  chosen  by  100  or  more  examinees,  the  provisional  polyweight  is 
equal  to  the  mean  percentile  rank  among  examinees  choosing  the  answer,  rounded  to  die  nearest 
integer. 

3.  For  each  wrong  answer  chosen  by  fewer  than  100  examinees,  the  provisional  poly  weight 
is  a  rounded  linear  combination  of  the  mean  percentile  rank  among  examinees  choosing  the  answer 
and  the  mean  pjercentile  rank  among  examinees  choosing  any  wrong  answer  on  the  item.  For  these 
response  categories,  the  poly  weight  for  category  j  of  item  /  is  equal  to 


rounded  to  the  nearest  integer.  In  Equation  1 ,  is  the  mean  percentile  rank  among  examinees 
choosing  any  wrong  answer  on  item  /,  Rjj  is  the  mean  percentile  rank  among  examinees  choosing 
category  j,  and  Njj  is  the  number  of  examinees  choosing  category  j. 

4.  For  a  given  item,  if  the  provisional  polyweight  for  an  incorrect  response  is  less  than  the 
polyweight  for  the  correct  response,  the  provisional  poly  weight  is  used  as  the  category  poly  weight. 
However,  if  the  provisional  poly  weight  for  an  incorrect  response  equals  or  exceeds  the  poly  weight 
for  the  correct  response,  the  polyweighi  for  the  incorrect  response  is  set  equal  to  1  less  than  the 
polyweight  for  the  correct  response.  Thus,  under  polyweighting,  examinees  never  receive  more 
credit  for  an  incorrect  answer  than  for  a  correct  answer. 

Examinee  percentile  ranks  range  from  a  minimum  possible  value  of  1(X)(1//V)  to  a  maximum 
possible  value  of  100  (where  N  is  the  number  of  examinees  in  the  item  calibration  sample).  Thus, 
polyweights  can  assume  any  integer  value  from  0  to  100.  Since  polyweights  are  derived  from 
examinee  percentile  ranks,  and  since  percentile  ranks  are  independent  of  the  difficulty  of  the  items 
administered,  polyweights  obtained  for  an  item  are  independent  of  the  difficulty  of  the  other  items 
administered. 

Polyweighting  is  not  based  on  IRT,  and  does  not  require  any  assumptions  regarding  ’iatent” 
abilities,  the  dimensionality  of  the  set(s)  of  items  analyzed,  or  the  mathematical  form  of  the 
regression  of  item  responses  on  unobservable  variables.  The  procedure  does  assume  that  the 
individuals  included  in  an  item  analysis  are  randomly  sampled  from  the  examinee  population  of 
interest. 

Unlike  some  scoring  methods,  polyweighting  gives  the  examinee  more  credit  for  correct 
answers  to  difficult  questions  and  less  credit  for  correct  answers  to  easy  questions.  Also, 
polyweighting  penalizes  the  examinee  more  heavily  for  wrong  answers  to  easy  questions  than  for 
wrong  answers  to  difficult  questions.  This  may  be  contrasted  with  number/proportion-correct 
scoring  and  with  scoring  under  the  1 -parameter  (Rasch)  and  2-parameter  logistic  IRT  models.  The 
latter  scoring  methods  assign  scores  to  examinees  in  a  manner  that  renders  the  scores  independent 
of  the  difficulty  of  the  questions  answered  correctly  or  incorrectly  (Bimbaum,  1968,  p.  4.‘>8). 

Method 

Data  from  1,100  physicians  who  completed  a  2(X)-item  test  in  the  field  of  otolaryngology  (the 
diagnosis  and  treatment  of  ear,  nose,  and  throat  disorders)  were  obtained.  Five  hundred  of  these 
physicians  were  selected  at  random  to  make  up  “Sample  A.”  Five  hundred  different  physicians 
were  selected  at  random  to  make  up  “Sample  B.”  The  program  POLY  was  then  applied  to  the 
Sample  A  data  to  obtain  item  summary  statistics  and  polyweights  for  all  200  items. 

Next,  using  the  set  of  200  items  as  an  item  bank,  20  different  assessment  tests  were  assembled 
and  scored  in  Sample  B.  These  tests  were  as  follows: 

1.  Three  randomly- selected  item-sets  of  size  10  were  designated  as  tests  RlO-1,  RlO-2.  and 
RlO-3.  Three  samples  of  items  were  used  in  order  to  obtain  an  indication  of  the  amount  of  sampling 
variation  in  reliability  and  domain  validity  that  could  be  expected  when  tests  are  assembled  by 
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randomly  sampling  items.  Since  items  in  the  2CX)-item  test  had  been  allocated  to  five  content 
categories  by  expen  (physician)  consultants,  two  items  were  randomly  selected  from  each  content 
category,  in  order  to  ensure  that  each  test  was  content  valid. 

2.  In  a  manner  similar  to  the  10-item  tests,  three  randomly-selected  item-sets  of  size  20  were 
designated  as  tests  R20-1,  R20-2,  and  R20-3.  Each  of  these  tests  included  items  from  one  of  the 
randomly-assembled  10-item  tests.  R20-1  included  the  items  making  up  test  RlO-1,  R20-2 
included  the  items  in  test  RlO-2,  and  R20-3  included  the  items  in  test  RlO-3.  In  these  tests,  four 
items  were  randomly  selected  from  each  of  the  five  content  categories. 

3.  Three  randomly-selected  item-sets  of  size  30  were  designated  as  tests  R30-1.  R30-2,  and 
R30-3.  Each  of  these  tests  included  items  from  one  of  the  randomly-assembled  20-item  tests. 
R30-1  included  the  items  making  up  test  R20-1,  R30-2  included  the  items  in  test  R20-2.  and  R30-3 
included  the  items  in  test  R20-3.  In  these  tests,  six  items  were  randomly  selected  from  each  of  the 
five  content  categories. 

4.  Three  randomly-selected  item-sets  of  size  40  were  designated  as  tests  R40-1.  R40-2,  and 
R40-3.  Each  of  these  tests  included  the  items  from  one  of  the  randomly-assembled  30-item  tests. 
R40-1  included  the  items  making  up  test  R30-1,  R40-2  included  the  items  in  test  R30-2,  and  R40-3 
included  the  items  in  test  R30-3.  In  these  tests,  eight  items  were  randomly  selected  from  each  of 
the  five  content  categories. 

5.  Using  the  results  of  the  Sample  A  200-item  POLY  run,  tests  of  length  10,  20,  30,  and  40 
items  were  assembled  using  "traditional”  item  selection  criteria.  In  this  test  construction  procedure, 
items  were  selected  that  had  the  highest  correct-answer  point-biserial  correlations  (Henrysson, 
1971,  p.  142),  subject  to  a  requirement  that  all  item  difficulties  (proportions  correct)  had  to  be 
within  .10  of  the  mean  item  difficulty  in  the  200-item  domain.  The  resulting  tests  were  designated 
as  tests  TIO,  T20,  T30,  and  T40.  Test  T20  included  the  items  making  up  test  TIO,  test  T30  included 
the  items  in  test  T20,  and  test  T40  included  the  items  in  test  T30.  As  before,  item  selection  was 
accomplished  within  the  designated  content  categories,  with  k  items  being  selected  from  each 
category  for  a  5/:-item  test. 

6.  Using  the  results  of  the  Sample  A  200- item  POLY  run,  tests  of  length  10,  20,  30,  and  40 
items  were  assembled  by  selecting  the  items  within  each  content  category  that  had  the  highest  r\ 
coefficients  (Lord  &  Novick,  1968,  p.  263).  In  this  context,  the  squared  t|  coefficient  for  an  item 
indicates  the  proportion  of  variance  in  percentile  ranks  that  is  accounted  for  by  knowing  which 
response  category  each  examinee  has  selected.  These  four  tests  were  designated  as  tests  EM  10, 
EM20,  EM30,  and  EM40.  Test  EM20  included  the  items  making  up  test  EM  10,  test  EM30 
included  the  items  in  test  EM20,  and  test  EM40  included  the  items  in  test  EM30.  As  before,  k  items 
were  selected  from  each  content  category  for  a  5it-item  test. 

Each  of  the  20  tests  described  above  was  scored  two  different  ways  in  Sample  B.  First,  each 
test  was  scored  by  assigning  a  weight  of  1  to  all  correct-response  categories,  a  weight  of  0  to  all 
incorrect-response  categories,  and  computing  the  mean  weight  among  the  categories  selected.  This 
gave  the  traditional  proportion-correct  (PC)  score.  Next,  each  test  was  scored  using  the 
polyweights  derived  in  Sample  A.  For  each  Sample  B  examinee,  his/her  polyscore  was  the  mean 
Sample  A  polyweight  among  the  categories  selected  by  the  examinee. 
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For  each  of  the  20  tests,  Sample  B  item  and  test  scores  were  used  to  compute  coefficient-a 
(Cronbach,  1951)  for  both  PC  sconng  and  for  polyweighting.  The  two  resulting  values  of  a  for 
each  test  were  then  used  to  compute  a  value  of  the  following  relative  information  index: 

H.  (2) 

otrfd  -  V 


This  index  is  based  on  the  Spearman-Brown  formula  (Lord  &  Novick,  1968,  p.  112).  The 
Spearman-Brown  formula  gives  the  reliability  of  a  lengthened  test  as  a  function  of  the  initial 
reliability  of  the  test  and  the  proportionate  increase  in  test  length  that  is  anticipated.  However, 
rather  than  use  the  Spearman- Brown  formula  to  predict  reliability,  one  can  rearrange  the  formula 
and  use  it  to  determine  how  much  a  given  test  would  have  to  be  increased  in  length  in  order  to 
obtain  a  specified  level  of  reliability  (Nishisato,  1980,  p.  118). 

In  Equation  2,  is  the  value  of  coefficient-a  obtained  under  PC  scoring  and  is  the  value 
of  coefficient-a  obtained  under  polyweighting.  This  information  index  indicates  the  proportionate 
increase  in  test  length  that  would  be  required  in  order  to  achieve  the  same  reliability  under  PC 
scoring  that  was  achieved  using  polyweighting. 

Next,  for  each  of  the  20  tests.  Sample  B  test  scores  and  Sample  B  domain  scores  (based  on  all 
200  items)  were  used  to  compute  domain  validities.  For  PC  scoring,  each  examinee’s  domain  score 
was  the  examinee’s  proportion  correct  on  the  200-item  test.  For  polyweighting,  examinee  domain 
scores  were  obtained  by  running  POLY  on  the  Sample  B  data  for  all  200  items.  It  is  relevant  to  note 
that  under  PC  scoring  the  weight  (1  or  0)  assigned  to  any  given  response  category  was  the  same 
when  an  item  appeared  in  a  short  assessment  test  and  when  it  was  part  of  the  domain.  On  the  other 
hand,  as  a  result  of  sampling  error,  the  Sample  A  polyweight  assigned  to  a  response  category  during 
scoring  of  an  assessment  test  in  Sample  B  was,  in  general,  somewhat  different  than  the  weight 
assigned  to  that  category  during  the  computation  of  Sample  B  domain  scores. 

Finally,  after  computing  two  Sample  B  domain  validities  for  each  test,  the  difference  was 
obtained  for  each  test: 


^  “  P/7  ■  Prf  ’ 


(3) 


where  Pp  is  the  domain  validity  under  polyweighting  and  is  the  domain  validity  under  PC 
scoring. 
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Results  and  Discussion 


Table  1  shows  the  results  of  this  comparative  evaluation  of  PC  scoring  anJ  polyweighting. 
Inspection  of  Table  1  shows  that  for  all  combinations  of  test  length  and  test-construction  method, 
polyweighting  outperforms  PC  scoring  in  the  cross-validation  sample. 

Table  1 


Cross-validated  Reliability  and  Domain  Validity  of 
Proportion-correct  Scores  and  Polyscores  for  20  Tests 


1 

1 

Test 

Reliability  (a) 

Domain  Validity 

T^'pe  of  Score 

H 

T\pe  of  Score 

D 

PC 

Poly 

PC 

Poly 

RlO-1 

.252 

.322 

1.41 

.272 

.376 

.104 

RlO-2 

.299 

.339 

1.20 

.288 

.369 

.081 

RlO-3 

.355 

.461 

1.56 

.437 

.538 

.101 

R20-1 

.517 

.580 

1.29 

.531 

.635 

.104 

R20-2 

.534 

.586 

1.24 

.560 

.623 

.063 

R20-3 

.508 

.623 

1.60 

.577 

.705 

.128 

R30-1 

.647 

.697 

1.26 

.690 

.757 

.067 

R30-2 

.582 

.646 

1.31 

.634 

.695 

.061 

R30-3 

.599 

.691 

1.50 

.658 

.764 

.106 

R40-1 

.701 

.755 

1.31 

.758 

.826 

.068 

R40-2 

.675 

.727 

1.28 

.731 

.786 

.055 

R40-3 

.701 

.777 

1.49 

.755 

.828 

.073 

TIO 

.583 

.605 

1.10 

.597 

.664 

.067 

T20 

.720 

.740 

1.11 

.751 

.812 

.061 

T30 

.778 

.799 

1.14 

.815 

.870 

.055 

T40 

.824 

.841 

1.13 

.847 

.899 

.052 

EMIO 

.625 

.656 

.606 

.673 

.067 

EM20 

.738 

.766 

1.16 

.760 

.819 

.059 

EM30 

.810 

.833 

1.17 

.815 

.881 

.066 

EM40 

.843 

.862 

1.17 

.841 

.911 

.070 

As  expected,  both  coefficient-a  and  domain  validity  increase  as  test  length  increases, 
regardless  of  test-construction  method  and  scoring  method.  Also,  as  might  be  expected,  both 
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coefficient-a  and  domain  validity  are  higher  for  the  systematically-consf’oicted  tests  than  for  the 
randomly-assembled  tests. 

For  each  test  length,  tests  made  up  of  items  with  maximum  t|  coefficients  are  more  reliable  than 
tests  assembled  using  the  traditional  method.  However,  under  PC  scoring  the  "EM"  tests  are  not 
always  superior  to  the  "T"  tests  when  domain  validity  is  the  criterion. 

For  the  randomly-assembled  tests  (R 10-1  through  R40-3),  the  H  statistics  in  column  4  indicate 
that,  on  the  average,  poly  weighting  increased  coefficient-a  by  an  amount  that  corresponds  to  a  379f 
increase  in  test  length.  Smaller  increases  are  observed  for  the  systematically-constructed  tests, 
where  the  mean  value  of  //  is  1.14.  There  is  an  indication  that  the  EM  tests  benefit  slightly  more 
from  poly  weighting,  since  the  mean  H  for  these  four  tests  is  1 . 1 6.  vs.  1.12  for  ihe  four  T  tests. 

The  D  statistics  in  column  7  indicate  that,  on  the  average,  polyweighting  increased  domain 
validity  for  the  randomly-assembled  tests  by  .084.  For  the  traditionally-constructed  (T)  tests,  the 
mean  value  of  D  is  .059.  For  the  EM  tests,  the  average  increase  in  domain  validity  is  .066.  Over  all 
20  tests,  the  minimum  increase  in  domain  validity  is  .052. 

An  important  comparison  that  is  implicit  in  Table  1  can  be  obtained  by  contrasting  a- 
coefficients  and  domain  validities  of  tests  that  were  assembled  using  the  traditional  method  and 
scored  dichotomously  with  those  of  tests  that  were  assembled  using  T|-coefficients  and  scored 
polychotomously.  This  provides  a  comparison  between  currently  prevailing  (dichotomous)  test- 
construction  and  scoring  practice  and  an  alternative  (polychotomous)  approach.  Comparison  of 
a-coefficients  (.656  vs.  .583,  .766  vs.  .720,  .833  vs.  .778,  and  .862  vs.  .824)  results  in  a  mean  H 
statistic  of  1.35.  indicating  that  a  combination  of  polychotomous  item-selection  and  scoring 
provides  an  increase  in  reliability  that  corresponds  to  a  35%  increase  in  test  length.  Comparison  of 
domain  validities  (.673  vs.  .597,  .819  vs.  .751,  etc.)  results  in  a  mean  D  statistic  of  .069.  with  a 
minimum  increase  in  domain  validity  of  .064. 

Conclusion 

Results  of  this  study  indicate  that  polyweighting  can  provide  consistent  increases  in  test 
reliability  and  domain-related  validity.  The  findings  also  suggest  that  polyweighting  should  allow 
test  developers  to  reduce  test  length,  while  maintaining  test  reliability  at  the  level  observed  under 
traditional  number/proportion-correct  scoring. 
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